University of Konstanz – Applied Visual Analytics

VAST 2010 Challenge
Hospitalization Records -  Characterization of Pandemic Spread

Authors and Affiliations:

Andrada Stefana Astefanoaie, University of Konstanz, Andrada.Astefanoaie@uni-konstanz.de 
Rodica Bozianu, University of Konstanz, Rodica.Bozianu@uni-konstanz.de
Roland Jungnickel, University of Konstanz, Roland.Jungnickel@uni-konstanz.de

 

Dr. Peter Bak, University of Konstanz, Peter.Bak@uni-konstanz.de

Tool(s):

For the analysis of the data we used the data mining tool KINME. KNIME, is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models. Also we used R and some self-written Java programs. For the visualization of the extracted data we used IBM's Many Eyes and the Protovis toolkit. Protovis composes custom views of data with simple marks such as bars and dots. It uses JavaScript and SVG for web-native visualizations.

 

Video:

 

http://ava.dbvis.de/MC2/AVA-final-video.mp4

 

 

ANSWERS:


MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

 

1.       Introduction

In Mini Challenge 2 the task was to identify and characterize the spread of an epidemic outbreak. At some steps we used visualizations in order to extract the required information from the provided data.

For the analysis of the data we used the Konstanz Information Miner (KINME) [1] data mining tool, R[2] and some self-written Java programs. For the visualization of the extracted data we used IBM's Many Eyes[4] and the Protovis[3] toolkit.

2.       Analytic Pipeline

To find the answers for both questions of the Mini Challenge we designed an analytic pipeline which combines automatic and semi-automatic data analysis and interactive visual explorations.

 

analytic-pipeline.png

Figure 1.   The analytic pipeline represents the workflow that was used to extract the information needed to answer the Mini Challenge.

The analytic pipeline is divided into three parts:

1.       preparation of data

2.       information extraction

3.       result

 

2.1. Preparation of data

2.1.1. Initial Analysis

 

The first part of the analytic pipeline, common to both questions of Mini Challenge 2, is comprised of data preprocessing and the initial analysis. The raw input data consists of csv files containing data for  11 locations. For each location we were supplied with two csv files, one csv file with hospitalized patients and one with dates about patients that died.

 

From the initial analysis we concluded that the average mortality rate is 2.45%. Another observation we made was the equal distribution of affected males and females. From this we concluded that the gender is neutral.

 

2.1.2.Data Preprocessing

The first preprocessing activity was to merge the two patient tables. The second was to clean the values in the symptoms column. This process was done in multiple steps. First, the records with more than one symptom were comma separated. Second, we replaced abbreviations and duplicate symptoms in order to have only one term for each specific symptom.

The initial analysis and data preprocessing took about 20 hours (8 hours to manually replace abbreviations, 12 hours to programmatically generate necessary data).

 

2.2.  Information extraction - selection and visualization

The Information Extraction phase is subdivided into two parts, data reduction and visualization. The Information Extraction was done with the aforementioned tools and took about 10 hours.

To extract only the symptoms that characterize the disease, the number of occurrences for each symptom in each day was counted for both hospitalizations and deaths. Afterwards, the mortality rate for each symptom was calculated. Using this data we discovered the symptoms of the disease. To find the symptoms that characterize the disease we used the treemap presented in Figure 2.

 

treemap

Figure 2.   Treemap representing the symptoms with the highest occurrence and a mortality rate > 1% for each location. Location is encoded by color hue, occurrence of a symptom by area and mortality rate by color saturation.

 

In a confirmatory step, the correlation of the symptoms was calculated and visualized with the Arc Graph presented in Figure 3. The visualization shows that some symptoms aren’t correlated with the main symptoms, but it cannot be clearly stated that these symptoms have no relation with the disease because of the high occurrence.

 

arcgraph

Figure 3.   Arc Graph - shows the connection between the most common symptoms. The size of each node shows the number of occurrences of the symptoms and the weight of the arc represents the number of records with both symptoms

To gain more insight in the temporal patterns of the disease, a Stack hierarchy was used. To keep the visualization readily comprehensible, we focused on the symptoms discovered with the help of the treemap.

stack-hierachy

Figure 4.   Stack Graph - focuses only on the main symptoms. The number of occurrences is represented by vertical thickness. The x-axis represents time.

 

The Stack hierarchy reveals that the most affected locations are Aleppo, Karachi and Nairobi. In the second question of the Mini Challenge, we confirm this fact by using other visualizations. 

With this visualization it is possible to see different aspects of the data, such as the total number of people that died, and the number of people that recovered for a given symptom in each location.

 

2.3.  Results

Results were extracted by analyzing the visualizations that were created. Each visualization helped to get intermediate results which determined the next steps of the analytic process. Extracting the results took about 7 hours total.

 

By interpreting the treemap, the characterizing symptoms were identified. They are divided into main symptoms - abdominal pain, nose bleed, vomiting, vomiting blood, diarrhea – and accompanying symptoms – back pain, fever, neck pain.

The outbreak lasted from April 24th until April 30th.  In this time period, the occurrence of the main symptoms increased suddenly at the beginning (from April 28th), then decreased briefly around May 6th just to start increasing again before reaching the peak.

 

The peaks of the disease in the different locations are reached on the following dates (without Turkey and Thailand, which we show in the second question of the Mini Challenge to be unaffected):

1.       Kenya (Nairobi) – 16th May

2.       Syria (Aleppo) – 17th May

3.       Lebanon, Yemen and Pakistan (Karachi) - 19th May

4.       Venezuela, Saudi Arabia and Iran – 20th May

5.       Colombia – 21th May

 

Recovery begins after each peak, and full recovery is reached around the 11th June for all locations.

The average mortality rate is 6,762296%, which is higher than the initial average mortality rate that took into consideration all hospitalized people. Without taking Thailand and Turkey into consideration it increases to 7,36427%. 

 

3.       Conclusion

By using existing tools it was possible to extract the needed information to answer the questions of the Mini Challenge. We discovered the main symptoms of the disease (abdominal pain, nose bleed, vomiting, vomiting blood, diarrhea), the over-all mortality rate about 6.76%, the onset of the disease occurred from April 24th until April 30th, it reached the peak between May 16th and May 21st and full recovery was reached around June11th.

 

4.       References

[1]     KNIME (Konstanz Information Miner) - http://www.knime.org/

[2]     R - http://www.r-project.org/

[3]     Protovis - http://vis.stanford.edu/protovis/

[4]     Many Eyes - http://manyeyes.alphaworks.ibm.com/manyeyes/

 


MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

 

1.       Introduction

In Mini Challenge 2 the task was to identify and characterize the spread of an epidemic outbreak. At some steps we used visualizations in order to extract the required information from the provided data.

For the analysis of the data we used the data mining tool Konstanz Information Miner (KINME) [1], R[2] and some self-written Java programs. For the visualization of the extracted data we used IBM's Many Eyes[4] and the Protovis[3] toolkit.

2.       Analytic Pipeline

To find the answers for both questions of the Mini Challenge we designed an analytic pipeline which combines automatic and semi-automatic data analysis and interactive visual explorations.

 

analytic-pipeline

Figure 1.   The analytic pipeline represents the workflow that was used to extract the information needed to answer the Mini Challenge.

The analytic pipeline is divided into three parts:

1.       preparation of data

2.       information extraction

3.       result

 

2.1. Preparation of data

2.1.1. Initial Analysis

 

The first part of the analytic pipeline, common to both Mini Challenge questions, is subdivided into an initial analysis of the data and the preprocessing. For the initial analysis the raw input data which consists of csv files for 11 locations was used. For each location we were supplied with two csv files, one with hospitalized patients and one with dates about patients that died.

From the initial analysis we concluded that the average mortality rate is 2.45%. Another observation we made was the equal distribution of affected males and females. From this we concluded that the gender is neutral.

 

2.1.2.Data Preprocessing

The first preprocessing activity was to merge the two patient tables. The second was to clean the values in the symptoms column. This process was done in multiple steps. First, the records with more than one symptom were comma separated. Second, we replaced abbreviations and duplicate symptoms in order to have only one term for each specific symptom.

The initial analysis and data preprocessing took about 20 hours (8 hours to manually replace abbreviations, 12 hours to programmatically generate necessary data).

 

2.2. Information extraction

The Information Extraction phase is subdivided into two parts, data reduction and visualization. The Information Extraction was done with the aforementioned tools and took about 10 hours.

2.2.1. Filtering

In order to filter the data, the number of occurrences for each symptom, each day for hospitalizations and deaths were counted. Afterwards it was possible to calculate the mortality rate for each symptom by day.

 

2.2.2. Visualization

In the first question of the Mini Challenge the characterizing symptoms of the disease were identified with the help of the treemap in Figure 2

 

treemap

Figure 2.   Treemap representing the symptoms with the highest occurrence and a mortality rate > 1% for each location. Location is encoded by color hue, occurrence of a symptom by area and mortality rate by color saturation.

 

Once the symptoms were identified, the outbreaks and the peaks were established according to the Small Multiples Chart in Figure 3.

 

small-multiples

Figure 3.   Small Multiples Chart - shows in the first column the evolution of each day for each location of the number of people that died (by taking into consideration the day of hospitalization and the day of the death). In the second column represents the total number of patients infected. The third column shows the evolution of the percentage of people that died.

 

Figure 3 shows correlation between number of people hospitalization, deaths and mortality rate.

 

2.3. Results

By visualizing the extracted information, we established that in each country, the outbreak of the disease is almost in the same time, namely between April 24th and April 30th. The first two countries to experience the outbreak were Nairobi and Lebanon. The order of the outbreaks is not clearly discernible because most cities have a high difference in the number of patients from one day to another at the beginning of the outbreak. Almost all cities have a small increase of the number of patients after which it decreases and increases again. The order of the peaks can be easily observed:

 

1.       Kenya (Nairobi) – 16th May

2.       Syria (Aleppo) – 17th May

3.       Lebanon, Yemen and Pakistan (Karachi) - 19th May

4.       Venezuela, Saudi Arabia and Iran – 20th May

5.       Colombia – 21th May

 

These peaks were confirmed in the first question of the Mini Challenge.

 

Figure 3 reveals that Thailand and Turkey don’t have a clear peak, and the number of patients and deaths is fluctuating very strongly. From this observation we conclude that these countries were not affected by the disease.

 

The World Map in Figure 4 was chosen to see if there is any correlation between the location of a country and the values and to compare the spread across cities.

The most affected country is Syria, being represented by Aleppo. The next one is Kenya (represented by Nairobi) with a very small difference.

world-map

Figure 4.   World Map -  shows the average mortaliy rate per country, the recovery ability per country and the percentage of people infected per country.

 

In all countries the average of people that were hospitalized and infected with the disease is about 30%. The mortality rate in the given locations is about 10%.

These results were extracted in about 6 hours by interpreting the visualizations that were created.

 

3.       Conclusion

By using existing tools it was possible to extract the needed information to answer the questions of the Mini Challenge. We identified Nairobi as the first location to be infected and to reach the peak. Aleppo is the most affected location and the different locations have in general a recovery ability of about 90%. Turkey and Thailand might be considered as anomalies because they weren’t affected.  

 

4.       References

 

[1]     KNIME (Konstanz Information Miner) - http://www.knime.org/

[2]     R - http://www.r-project.org/

[3]     Protovis - http://vis.stanford.edu/protovis/

[4]     Many Eyes - http://manyeyes.alphaworks.ibm.com/manyeyes/